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Abstract 

This  work  investigates  the  application  of  evolutionary  programming,  a  stochastic  search  technique,  for 
simultaneously  determining  the  weights  and  the  number  of  hidden  units  in  a  fully-connected,  multi-layer  neural 
network.  The  simulated  evolution  search  paradigm  provides  a  means  for  optimizing  both  network  structure  and 
weight  coefficients.  Orthogonal  learning  is  implemented  by  independently  modifying  network  structure  and  weight 
parameters.  Different  structural  level  search  strategies  are  investigated  by  comparing  the  training  processes  for 
the  3-bit  parity  problem.  The  results  indicate  that  evolutionary  programming  provides  a  robust  framework  for 
evolving  neural  networks. 


Introduction 

The  traditional  neural  network  design  generally  consists  of  a  fixed  architecture.  The  static  structure  determined  by 
the  network  designer  can  impose  constraints  on  the  network  which  may  not  allow  for  an  adequate  solution  or  can 
result  in  an  unnecessarily  complex  arrangement  with  excess  parameters.  This  work  investigates  the  application  of 
evolutionary  programming  for  simultaneously  optimizing  the  network  structure  as  well  as  the  weight  coefficients. 
The  network  weights  are  stochastically  modified  using  Gaussian  mutations  proportional  to  the  network's  objective 
function.  Three  mutation  strategies  are  investigated  in  the  optimization  of  the  network  architecture.  The  first 
method  incorporates  the  standard  deviation  of  the  hidden  unit  activation  levels  over  all  of  the  training  patterns  into 
the  stochastic  search  process.  As  a  comparison,  the  second  and  third  methods  are  purely  random  approaches. 

Evolutionary  programming  (EP)  is  a  neo-Darwinian,  multi-agent,  global  optimization  paradigm  that  provides  a 
systematic  means  for  employing  stochastic  search.  EP  can  be  applied  to  the  neural  network  design  process  to 
determine  both  the  network  structure  as  well  as  the  network  parameters.  This  investigation  focuses  on  optimizing 
the  number  of  hidden  units  in  a  fully-connected  feedforward  network.  A  potential  benefit  of  evolving  near 
minimal  size  neural- networks  is  that  hardware  implementations  or  software  emulations  may  realize  a  significant 
increase  in  throughput  for  designs  which  require  near  real-time  capability. 

Previous  work1  has  utilized  EP  to  reduce  the  number  of  connections  and,  as  a  result,  the  number  of  active  hidden 
units.  Others  have  incorporated  variable  network  structures  into  the  training  process.  Ash2  has  developed  the 
dynamic  node  creation  algorithm  which  created  new  nodes  in  the  hidden  layer  when  the  training  error  rate  fell 
below  an  arbitrarily  chosen  critical  value.  Hirose  et  al.3  use  a  similar  approach  for  node  creation,  but  also  remove 
nodes  when  small  error  values  are  attained.  Aylward  and  Anderson4  postulate  a  set  of  rules  based  on  error  rate, 
convergence  criterion  and  distance  to  the  desired  error  level.  If  the  rules  are  not  continually  satisfied,  then  a  new 
node  is  added.  Fahlman  and  Lebiere3  add  additional  units  which  are  fully  connected  from  previously  generated 
hidden  units  and  the  input  neurons.  Their  cascade-correlation  architecture  selects  the  best  additional  hidden  unit 
from  a  pool  of  candidate  units  in  minimizing  the  network  error.  Lee6  generates  additional  neurons  based  on 
temporal  changes  in  a  neuron's  weight  space  and  removes  neurons  based  on  activity  variance  and  the  distance 
between  similar  weight  parameter  vectors.  While  not  as  sophisticated  as  the  techniques  used  by  Lee6,  the  approach 
employed  in  this  work  does  not  require  an  overt  amount  of  computation  for  a  single  network,  but  instead  exploits 
the  stochastic  multi-agent  search  process  to  evolve  a  single  suitable  network  from  a  multitude  of  candidate 
networks  which  constitute  a  population. 
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Figure  1.  One  generation  of  the  evolutionary  search  process. 
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The  EP  algorithm  utilized  in  this  work  is  based  on  the  paradigm  given  by  Fogel7.  A  modification  was  made  to 
account  for  evolving  disparate  quantities  such  as  the  network  weights  and  the  number  of  hidden  units.  The 
following  steps  describe  the  EP  optimization  algorithm  employed  herein: 


1.  Form  an  initial  population  Pjp.ifw)  of  size  3N.  The  weigjht  coefficients  w  associated  with  each  parent  element 
P,  are  randomly  initialized 

2.  Assign  a  cost  Jt  to  each  element  Pfw)  in  the  population  based  on  the  objective  function. 

3.  Reorder  the  population  based  on  the  number  of  wins  generated  from  a  stochastic  competition  process. 

4.  Generate  offspring  (PN....  from  the  highest  ranked  N  elements  (PQ ....  in  the  population.  The  first 

set  of offspring  (PN ....  Pw-ii  generated  by  perturbing  the  weights  w,  and  the  second  set  of  offspring  (P^ .... 
Pjfj.j)  are  generated  by  the  stochastics  governing  the  selection  of  the  hidden  nodes  (as  discussed  below). 

5.  Loop  to  step  2. 


Figure  1  illustrates  the  evolutionary  search  process  over  a  single  generation.  Other  techniques  (such  as 
backpropagation)  could  also  have  been  implemented  to  generate  the  offspring  with  a  new  weight  set 

Initially,  a  population  consisting  of  3N  feedforward  networks  is  generated.  The  weights  for  each  network  are 
instantiated  from  a  uniform  U(-0.5,  0.5)  distribution.  The  maximum  number  of  hidden  neurons  are  selected  from  a 
user  specified  search  domain.  The  objective  function /is  a  linearly-weighted  combination  of  the  number  of  hidden 
nodes  N„  and  mean  sum-squared  pattern  error  E  and  is  determined  according  to  J  =  aE  +  To  emulate  the 
probabilistic  nature  of  survival,  a  competition  process  takes  place  where  individual  elements  compete  against 
randomly  chosen  members  of  the  population.  A  network  receives  a  "win*  for  every  comparison  with  an  inferior 
network  (/.«.,  inferiority  equals  higher  cost).  The  N  networks  with  the  most  wins  are  retained  and  used  to  generate 
offspring  networks. 

The  first  set  of  offspring  networks  are  generated  simply  by  perturbing  the  parent  set  of  weights  according  to 
Wp-Wp  +  &Wp  where  SWp  is  a  multivariate  normally  distributed  random  variable  parameterized  by  N(0,  S^-J  ) 
with  a  scaling  coefficient  5^  and  an  objective  function  Jp  for  each  parent  network.  The  scaling  factor  can  be 
considered  the  probabilistic  analog  to  the  step  size  used  in  gradient  based  methods.  A  second  set  of  offspring  are 
generated  by  modifying  the  number  of  hidden  elements  which  are  active.  This  is  done  by  activating  or  deactivating 
the  outputs  of  randomly  selected  hidden  neurons.  Three  different  strategies  to  network  optimization  are 
investigated  in  this  work.  The  first  method  is  based  on  the  variance  in  the  activation  level  of  a  randomly  selected 
neuron.  If  the  selected  hidden  node  is  active,  it  is  deactivated  in  a  probabilistic  manner  based  on  the  standard 
deviation  cr  of  the  activation  level  as  determined  over  all  input  patterns  for  a  given  set  of  weights.  The  deactivation 
probability  is  given  by  Pjm  1  where  gr(r|.ff  is  the  maximum  deviation  of  all  the  active  neurons.  If  the 

selected  node  i  is  inactive,  then  its  probability  of  being  activated  is  determined  by  Pa  =  /a^  where  = 


max(  cr„cr _ J  so  that  Pa  <  1.  The  second  method  changes  the  state  of  a  randomly  selected  neuron  The  third 

method  changes  the  state  of  a  randomly  selected  neuron  50%  of  the  time.  Even  if  a  neuron  is  deactivated,  its 
weights  will  continue  to  be  updated  in  the  event  that  the  neuron  later  becomes  active.  Bias  nodes  always  remain 
active.  After  the  competition  process,  the  stochastic  search  procedure  is  repeated. 

« 

In  the  ensuing  investigation,  the  standard  deviation  method  outlined  above  is  referred  to  as  Strategy  1,  the  purely 
random  method  is  referred  to  as  Strategy  2,  and  the  purely  random  technique  with  an  additional  50%  probability  of 
changing  a  neuron's  state  of  activation  is  referred  to  as  Strategy  3.  Strategy  3  was  incorporated  to  investigate  the 
effects  of  artificially  constraining  the  search  versus  the  more  rigorous  search  that  takes  place  when  a  randomly 
selected  neuron's  activation  state  is  deterministically  changed. 

Results 

Training  trails  for  the  3-bit  parity  mapping  were  conducted  for  all  three  strategies.  The  runs  were  arbitrarily 
stopped  after  500  generations.  Each  network  was  initialized  with  the  maximum  20  hidden  nodes  fully  activated. 
All  of  the  runs  were  conducted  for  10  parent  networks  and  20  offspring  networks:  10  offspring  networks  with 
perturbed  weight  sets  and  10  offspring  networks  with  potential  for  structural  modifications.  The  objective  function 
is  given  by  J  »  £  +  0.01  *Nn  and  the  mutation  scaling  factor  was  set  as  Sj  =  50.  Figure  2  shows  the  MSE  and 
number  of  nodes  of  the  network  with  the  best  (/.«.,  lowest  cost)  J  at  each  generation  from  a  sample  training  run 
using  Strategy  1.  Note  that  over  the  first  50  generations  the  optimization  process  is  primarily  reducing  the  number 
of  hidden  nodes.  After  90  generations  the  network  has  converged  to  its  final  form  of  3-3-1. 

To  evaluate  the  performance  of  the  three  different  strategies,  comparisons  are  made  for  the  best  network  at  each 
generation  averaged  over  ten  trail  runs.  The  average  MSE  of  fee  best  network  for  each  strategy  is  shown  in  Figure 
3.  Strategies  1  and  2  converge  to  roughly  the  same  MSE  by  500  generations  while  Strategy  3,  the  artificially 
constrained  search,  has  an  average  MSE  which  is  4x  greater  due  to  poorer  convergence  characteristics.  The 
average  number  of  nodes  of  the  best  network  for  each  strategy  is  shown  in  Figure  4.  Both  Strategies  1  and  2 
converge  to  approximately  fee  same  number  of  active  nodes  by  150  generations.  The  average  cost  of  fee  best 
network  for  each  of  the  strategies  is  shown  in  Figure  5.  Strategies  1  and  2  are  very  similar  above  150  generations 
while  Strategies  2  and  3  are  roughly  fee  same  below  100  generations.  There  appears  to  be  a  short  transition  phase 
over  which  fee  Strategy  2  search  reconnects  neurons  which  are  significant  to  the  networks  performance.  The 
Strategy  3  search  is  slower  to  achieve  similar  results  due  to  fee  artificial  constraints  imposed  on  modifying  the 
network  structure.  The  variance  based  search  does  not  place  an  excess  emphasis  on  the  number  of  active  neurons 
and  is  (on  average)  primarily  disconnecting  hidden  units. 


Figure  2.  A  sample  training  trial  using 
Strategy  1. 


Figure  3.  Tke  average  MSE  for  the  best  network 
at  each  generation. 


Figure  4 L  The  average  number  of  hidden  units 
for  the  best  network  at  each  generation. 


HgureS.  The  average  cost  of  the  best  network 
at  each  generation  forJ~E+  0.01Ne. 


Coachuioa 

Although  a  small  sample  size  was  used  to  determine  the  performance  characteristics  using  three  different  network 
structure  level  modification  strategies,  the  results  appear  to  indirate  that  artificially  constraining  the  evolutionary 
search  process  impedes  the  construction  of  superior  networks.  Incorporating  knowledge  into  the  search  process 
may  be  advantageous  in  directing  the  search,  but  also  comes  with  added  computational  cost  which  was  not 
accounted  for  in  this  investigation.  In  contrasting  the  purely  random  approach  with  the  variance-based  strategy, 
there  is  no  significant  difference  fay  the  final  generation.  As  a  result,  the  simulated  evolution  paradigm  appears  to 
be  a  robust  approach  for  simultaneously  determining  the  number  of  hidden  nodes  and  weight  coefficients.  Further 
investigations  should  be  undertaken  in  analyzing  the  effects  of  the  cost  function  on  stochastic  search  process. 
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